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Abstract 

We consider a linear regression model and propose an omnibus test 
to simultaneously check the assumption of independence between the 
error and the predictor variables, and the goodness-of-fit of the para- 
metric model. Our approach is based on testing for independence 
between the residual and the predictor using the recently developed 
Hilbert-Schmidt independence criterion, see [GFT + 08]. The proposed 
method requires no user-defined regularization, is simple to compute, 
based merely on pairwise distances between points in the sample, and 
is consistent against all alternatives. We develop the distribution the- 
ory of the proposed test-statistic, both under the null and the alter- 
native hypotheses, and devise a bootstrap scheme to approximate its 
null distribution. We prove the consistency of the bootstrap procedure. 
The superior finite sample performance of our procedure is illustrated 
through a simulation study. 

Keywords: Bootstrap, goodness-of-fit test, linear regression, model checking, re- 
producing kernel Hilbert space, test of independence 

1 Introduction 

In regression analysis, given a random vector (X, Y) where X is a do-dimensional 
predictor and Y is the one-dimensional response, we want to study the relationship 
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between Y and X. The relationship can always be summarized as: 



Y = m(X) + rj, 



(1) 



where m is the regression function and rj := Y — m(X) is the error that has condi- 
tional mean (given X). However, in practice, we usually assume that X _LL rj, i.e., 
rj is independent of X. Moreover, in linear regression, we assume that m belongs to 
a parametric class, e.g., 



where g(x) = (gi(x), . . . ,gd(x)) T is the set of known (measurable) predictor func- 
tions, and (3 is the finite-dimensional unknown (coefficient) parameter. 

In this paper we propose an omnibus test to check simultaneously the assumption 
of independence between X and rj, and the goodness-of-fit of the linear regression 
model, i.e., test the null hypothesis 



given i.i.d. data (X\, Y\), . . . , (X n , Y n ) from the regression model (1). Even when 
we consider the predictor variables fixed, i.e., we condition on X, our procedure can 
be used to check whether the conditional distribution of rj given X depends on X. 
This will, in particular, help us detect heteroscedasticity (i.e., whether the condi- 
tional variance of rj given X depends on X) and departures from the assumption 
of homoscedasticity of the errors. As far as we are aware, there is no test that can 
simultaneously check for these two crucial model assumptions in linear regression. 

Our procedure is based on testing for the independence of X and the residual r) 
(obtained from fitting the parametric model) using the recently developed Hilbert- 
Schmidt independence criterion (HSIC); see [GFT + 08]. Among the virtues of this 
test is that it is automated (i.e., requires no user-defined regularization), extremely 
simple to compute, based merely on the distances between points in the sample, and 
is consistent against all alternatives. Also, compared to other measures of depen- 
dence, HSIC does not require any smoothness assumption on the joint distribution of 
X and rj (e.g., existence of a density), and its implementation is not computationally 
intensive for higher dimensions (i.e., when do is large). Moreover, this independence 
testing procedure also yields a novel approach to testing for the goodness-of-fit of the 
fitted regression model: under model mis-specification, the residuals, although un- 
corrected with the predictors (by definition of the least squares procedure) , are very 
much dependent on the predictors, and the HSIC can detect this dependence; under 
Hq, the test statistic exhibits n -rate of convergence, whereas, under dependence, 
we observe ra -1//2 -rate of convergence for the centered test statistics. 



M/3 := {g(x) T /3 :/3GR d } 



(2) 



H : X _LL rj, m G Mp 



(3) 
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We find the limiting distribution of the test statistic, under both the null and 
alternative hypotheses. Interestingly, we see that the asymptotic distribution is 
very different from that if the true error r/ were observed. To approximate the 
null distribution of the test statistic, we propose a bootstrap scheme and prove its 
consistency. Note that the usual permutation test, which is used quite often in 
testing independence, cannot be directly used in this scenario as we do not observe 
V- 

Over the last two decades several tests for the goodness-of-fit of a parametric 
model have been proposed under varied conditions on the distribution of the errors 
and its dependence on the predictors. [CKWY88] introduced tests of the null hy- 
pothesis that a regression function has a particular parametric structure under the 
assumption of independent homoscedastic normal errors. [AB93] used nonparamet- 
ric regression to check linear relationships under independent homoscedastic errors; 
also see [ES90] and [HM93]. In [FH01], some tests are proposed for examining the 
adequacy of a family of parametric models against large nonparametric alternatives 
under the assumption of independence and normality of the errors. In [GL05] the 
authors propose data-driven smooth tests for a parametric regression function. For 
other relevant work on this topic see [CS10], [Xia09] and the references therein. 
Any test using a nonparametric regression estimator runs into an ill-posed prob- 
lem requiring the delicate choice of a number of tuning parameters, e.g., smoothing 
parameter(s). However, a few alternative approaches have been developed that cir- 
cumvents these problems; see e.g. [Bie90] , [Stu97] . However, most of these tests are 
difficult to implement and do not usually work well when the dimension of X is not 
low. 

Although the independence of error and covariate is a common assumption in 
regression, there are very few methods available in the literature to test this. How- 
ever, there has been work on testing for heteroscedasticity between the error and 
the predictor; see e.g., [CW83], [BP79], [Ken08] and the references therein. In the 
nonparametric setup, [EVK08b] and [EVK08a] propose tests for independence but 
only for univariate covariates. Generalization to the multivariate case is recently 
considered in [NVK10]; also see [Neu09]. 

The paper is organized as follows: in Section 2 we introduce the HSIC and 
discuss other measures of dependence. We formulate the problem and state our 
main results in Section 3. A bootstrap procedure to approximate the distribution 
of the test statistic is developed in Section 4. A finite sample study of our method 
along with some well-known competing procedures is presented in Section 5. In 
Section 6 we present a result on triangular arrays of random variables that will 
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help us understand the limiting behavior of our test statistic under the null and 
alternative hypotheses, and yield the consistency of our bootstrap approach. The 
proofs of the main results are given in Section 7. 

2 Testing independence of two random vec- 
tors 

In this section we briefly review the HSIC for testing the independence of two random 
vectors; for a more systematic study of the procedure see [GFT+08], [GBSS05], 
[SSGF12]. We start with some background and notation. By a reproducing kernel 
Hilbert space (RKHS) J- of functions on a domain U with a positive definite kernel 
k : U xU —> M we mean a Hilbert space of functions from U to M with inner product 
(•,•), satisfying the reproducing property: 

(/(«), fc(«,.)> = /(«), 
for all / € J- and u £ U. We say that J- is characteristic if and only if the map 

/ k(-,u)dP(u), 
Ju 

is injective on the space of all Borel probability measures on IA such that j u k(u, u)dP[u) < 
oo. Likewise, let Q be a second RKHS on a domain V with positive definite kernel I. 
Let P uv be a Borel probability measure defined on U x V, and let P u and P v denote 
the respective marginal distributions on U and V. Assuming that 

E[k(U,U)]<oo and E[Z(V, V)] < oo, (4) 

where (U, V) ~ P uv , the HSIC of P uv is defined as 

9(U,V) := E[k{U,U')l{V,V')} + E[k(U,U')]E[l(V,V')} 

-2E[k{U,U')l(V,V")}, (5) 

where (U' ,V'),(U" ,V") are i.i.d. copies of (U,V). It is not hard to see that 
9(U,V) > 0. More importantly, when T and Q are characteristic (see [Lyoll], 
[SSGF12]), and (4) holds, then 

0(U, V) = if and only if P u „ = P u x P v . 

Given an i.i.d. sample (Ui,Vi)i<i< n from P m , we want to test whether P uv 
factorizes as P u x P v . For the purpose of testing independence, we will use a biased 
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empirical estimate of [[GBSS05], Definition 2], obtained by replacing the unbiased 
[/-statistics with the ^-statistic 

1 n 1 n 1 n 1 

^" := ^E ^ijlij + ^4 ^ hjlqr - 2^ ^ hjkq = ^ tvace(KHLH) , (6) 

i,j i,j,Q,r i,j,q 

where the summation indices denote all i-tuples drawn with replacement from 
{l,...,n}, t being the number of indices below the sum, kij := k(Ui,Uj), and 
hj '■= K^>^j)i K anc ^ L are n x n ma t r i x with entries k^ and lij, respectively, 
H := I — n _1 ll T , and 1 is an n x 1 vector of ones. Note that the cost of computing 
this statistic is 0(n 2 ). 

Examples of translation invariant characteristic kernel functions on MP, for p > 1, 
include the Gaussian radial basis function kernel k(u,u') = exp(— cr _2 ||u — it'll 2 ), 
a > 0, the Laplace kernel k(u,u') = exp(— o"" 1 ^ — it'||), the inverse multiquadratics 
k(u,u') = ((3 + ||n — -u'|| 2 ) -Q! , a, (3 > 0, etc. We will use the Gaussian kernel in our 
simulation results. 

One can, in principle, use any other test of independence and develop a theory 
parallel to ours. The choice of the HSIC is motivated by a number of computational 
and theoretical advantages, see e.g., [GBSS05] and [GFT + 08]. It is worth mentioning 
in this regard that the recently developed method of distance covariance, introduced 
by [SRB07] and [SR09], has received much attention in the statistical community. 
It tackles the problem of testing and measuring dependence between two random 
vectors in terms of a weighted L 2 -distance between characteristic functions of the 
joint distribution of two random vectors and the product of their marginals; see 
[SSGF12] for a comparative study of the HSIC and the distance covariance methods. 
However, the kernel induced by the semi-metric used in the distance covariance 
method (see [SSGF12]) is not smooth and hence is difficult to study theoretically, 
at least using our techniques. 

3 Our test statistic 

We will consider the regression model (1). We denote by Z = (X,if) ~ P where 
Z £ M d ° x R and E(r/|X) = 0. Let P x and P r] be the marginal distributions of X 
and r] respectively. To start with, we will assume that m does not necessarily belong 
to Mp, as defined in (2). Assuming that E[g(X)g(X) T ] < oo, E[m(X) 2 ] < oo and 
E[r/ 2 ] < oo, let us define 

D\P) :=E[(Y-g(X) T f3) 2 ], 
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for f} G M d . From the definition of m, we see that 

D\p) := E[(Y - m(X)) 2 } + E[(m(X) - g(X) T /5) 2 ]. 

The function D 2 is minimized at /3o if and only if (3q is a minimizer of 

D 2 (/3) :=E[(m(X)- 5 (X) T /3) 2 ]. 

The quantity D 2 (/3 ) measures the distance between the true m and the hypothetical 
model A4p. Clearly, if m(X) = g(X) T (3q, then (3$ = /3q. Under the assumption that 
E[g(X)g(X) T ] is invertible, D 2 (/3) has the unique minimizer 

k = E[g(X)g(X) T ]~ 1 E[m(X)g(X)}. 

Thus, g{x) T (3q is the "closest" function (in the least squares sense) to m(x) in M.p. 

Given i.i.d. data {X\,Y\), (X2, l2)> • • • , (X n ,Y n ) from the regression model (1), 
we compute the least squares estimator (LSE) in the class M.p as 



n := argmin^ (Y* - 5 (^) T /3) 2 - (7) 
i=i 



Letting 



note that 



n 

A n :=- WX 4 ) 5 (X 4 ) T , 

71 ^ 



n . 

i=i 



provided, of course, that A n is invertible. Let 

e, := Y - g{Xi) T $ n , (8) 
for i = 1, . . . , n, be the observed residuals. The test statistic we consider is 

n 1 n I n 

^ n = ^2 kijkj + ^2 hjlqr ~ 2 ^3 ^ (9) 

where kij := k(Xi,Xj), and := Z(ej,e,) with and Z being characteristic kernels 
defined on ]R d ° x ]R d ° and IxR respectively. Note that our test statistic is almost 
identical to the empirical estimate 9 n of HSIC between X and rj described in (6) ex- 
cept for the fact that we replace the unobserved errors rjiS by the observed residuals 

For any u = (u\, . . . , u p ) € W, we define the ^oo-norm of u as |it|oo = maxi<j< p |itj| 
We will assume throughout the paper that 
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(I) A := E[g(X)g{X) T ] is invertible. 

Moreover, we will always assume the following conditions on the kernels k, I. 

(K) The kernels k and I are characteristic kernels, k is continuous and I is twice 
continuously differentiable. Denoting the partial derivatives of / as l x (x,y) := 
d x l(x,y),l xy (x,y) := d x d y l(x,y), etc., we assume that l xx , l xy and l yy are 
Lipschitz continuous with Lipschitz constant L (w.r.t. the ^oo-norm). 

We study the behavior of the test statistic T n under the null hypothesis (3), and 
also under the following different scenarios: 

Hi: X JfL i],m £ M./3, 

H 2 : X lLr},m$Mp, (10) 
H 3 : X JL V ,m^Mp. 

To find the limiting distribution of T n under Hq, we will assume the following set 
of moment conditions on X and rj. 

(M) (a) E[\g(X)\U < oo, 

(b) E[r/ 2 ] < oo, 

(c) for any 1 < q, r, s, t < 4, 

E [k 2 (X q , X r )(l + | 5 (X S )|^)(1 + | 5 (X t )|D] < oo, 

(d) for any 1 < q, r < 2, 

(jlqi Vr)] ^ OO for f l,l X ,ly, l XX , lyy , I X y . 

Theorem 3.1 Suppose that conditions (I), (K) and (M) hold. Then, under Hq, 

where x has a non- degenerate distribution which depends upon P = Px x P^ and is 
denoted by x = x{Px x -Fr?)- 

Remark 3.1 The random variable x can be expressed as a quadratic function of 
a Gaussian field with a certain nontrivial covariance structure. This is in contrast 
with the limiting description of degenerate V-statistics where the limiting random 
variable can be described as a quadratic function of a family of independent Gaussian 
random variables. The explicit description of x is slightly complicated and can be 
easily recovered from the proof; see (31) in Section A. 1.3. However, from a practical 
point of view, such a description is of little use, since P is unknown to the user. 
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Remark 3.2 Though one might be tempted to believe that replacing the (unob- 
served) true errors r\i by the residuals ej should not alter the limiting distribution 
of the test statistic, it turns out that the two limiting distributions are significantly 
different; see Figure 1. 




-0.05 0.05 0.1 0.15 0.2 



Figure 1: Estimated density of nT n obtained with the true unknown errors 
(in solid blue) and the estimated residuals (in dashed red) in the linear model 
Y = 1 + X + n, where rj ~ JV(0, a 2 = 0.1), X ~ N(0, 1), X AL rj, and n = 100. 



Next we will study the limiting behavior of our test statistic T n under the different 
alternatives H\,Hi and H% as in (10). We first introduce the error under model 
mis-specification as 

e := m(X) - g(X) T fa + r,. 

Note that when m E Mp, e = rj. We assume the following set of moment conditions 
for H\ , Hi and i7 3 . 

(M') (a) E[|#(X)|^] < 00 and E[m(X) 2 ] < 00 

(b) E[r/ 2 ] < 00 and E[|#(X)|^e 2 ] < 00 

(c) for any 1 < q, r, s, t < 4, 

(i) E[fc 2 (X 3) X r )Z 2 ( es ,et)] <oo, 

(ii) E [|fc(X„X r )||Vi(e 8 ,et)|oo(|«7M|oo + bMloo)] < 00, 

(iii) E [|A;(Z 3) X r )|(| ff (X s )^ + | 5 (X t )|L)] < 00, 

(iv) E[|A:(X g ,X r )||Hess(0(e„e t )|oo(|5(^)lL + l9(^)lL)] < °o, 
where |V/(e s , e t )\oo = max (\l x (e s , e t )\, \l y (e s , e t )\) and 
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|Hess(0(e s ,e t )|oo = max (|/ xx (e s , e t )|, \l xy {e 8 , e t )\, \l yy (e s , e t )|) . Here e 1 ,...,e 4 
are i.i.d. copies of e. 

Theorem 3.2 Suppose that conditions (I), (K) and (M 1 ) hold. Assume further 
under H2 that m(X) — g(X) T 0q is not a constant. Then 

Vn~(T n -e) ^ d N(0,a 2 ), 

where 9 = 6(X, e) is defined in (5) and 6 > 0. The variance a 2 depends on the joint 
distribution of (X,e) and an expression for it can be found in (18) in the proof of 
the theorem. 

4 Consistency of the bootstrap 

Theorem 3.1 is not very useful in computing the cut-off value to test H , using the 
statistic nT n , as the asymptotic distribution x involves infinitely many nuisance pa- 
rameters. An obvious alternative in such a situation is to use a resampling technique 
to approximate the cut-off. In independence testing problems, a natural choice is 
to use the permutation test; see e.g., [SR09], [GFT + 08]. 

However, as we are using the residuals e^'s instead of the true unknown errors 
r/j's in our test statistic, a permutation based test will not work. Indeed, under 
the null hypothesis, the joint distribution of (Xi,e ir (i\)i<i< n is not invariant under 
the permutation w, even though the joint distribution of (Xi, %(i))i<j<Ti remains 
unchanged under tt. 

In this section we show that the bootstrap can be used to consistently approxi- 
mate the distribution of nT n , under Hq. In the following we describe our bootstrap 
procedure. 

1. Let P nje ° be the empirical distribution of centered residuals, i.e., 

e i = ej — e, 

for % = 1, . . . , n, where e^'s are defined in (8) and e = - YH=i e i- Let P n) x be 
the empirical distribution of the observed AYs. 

2. Generate an i.i.d. bootstrap sample (X* n , f?* n )i<i<n of size n, from the measure 

Pn '■= ^n,X X Pn,e°- 

3. Define 

Y^:= 5 (X*J T /3 n +r4, 
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for i = 1, ... ,n, where f3 n is the LSE obtained in (7). Compute the boot- 
strapped LSE ft*, using the bootstrap sample (Y^, X* n ),i = l,...,n. Also 
compute the bootstrap residuals 

for i = 1, . . . , 7i. 

4. Compute the bootstrap test statistic T*, defined as in (9), with Xi replaced by 
X* n , and &i replaced by e* n , for i = 1, . . . , n. We approximate the distribution 
of nT n by the conditional distribution of nT*, given the data. 

Assume that we have an infinite array of random vectors Z\, Z2, ■ ■ ■ where Zj := 
(Xi,rji) are i.i.d. from P defined on some probability space (Q,A, P). We denote 
by 3 the entire sequence {Zi) i>l and write = P(-|3) and E w = E(-|3) to denote 
conditional probability and conditional expectation, respectively, given 3- 

The following result shows that under Hq, the distribution of nT*, given the 
data (Xj,li)i<j< n , almost surely, converges to the same limiting distribution as 
that of nT n . Thus the bootstrap is strongly consistent and we can approximate 
the distribution function of nT n by P^(nT* < •), and use it to find the one-sided 
cut-off for testing Hq. To prove the result, we will need similar but slightly stronger 
conditions than those stated in (M). Recall that e = m(X) — g(X) T fio + 77, and set 
e° := e-E[e]. 

(M") There exists 5 > such that 

(a) E[| 5 (X)|^+ 25 ] < 00 and K[\m(X)\ 2+s ] < 00, 

(b) K[\ri\ 2+S ] < 00, 

(c) for any 1 < q, r, s, t < 4, 



E 



\k(x q ,x r )\ 2+s (i + \g(x s )\ 2 + s )(i + \g(x t )\ 2 + s ] 



< 00, 



(d) for any 1 < q, r < 2, 

E[|Z( e °,0| 2+5 ]<oo, 
E[(l + \g(X q )\ 2 + 5 )\f(e°,e°)\ 2+5 ] < 00 for / = l x ,l y , 

and 

E[(l + \g(X q )\ 2 + s + \g(X q )\£ 2S )\f(e° q ,e°)\ 2+5 ] < 00 for / = l xx ,l yy ,l xy . 
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Theorem 4.1 Suppose that conditions (I), (K) and (M") hold. Define the random 
variable e := m(X) — g(X) T /3q + rj, and e° = e — E[e], where (X, rj) ~ P. Then 

nT*^ dX {Px xP £0 ), 

conditional on the observed data a.s., where x is given in Theorem 3.1. As a con- 
sequence, under Hq, and nT* —>d x{Px x Prj), conditional on the observed data 
a.s. 

Remark 4.1 Note that it follows from Theorem 3.2 that nT n — > oo in probability 
under H\,Pl2 or H%. But by Theorem 4-1, the quantiles of the conditional distribu- 
tion of nT* are tight. Hence, the power of our test under H\ , H2 or H% converges 
to 1 as n — > 00. 

Remark 4.2 (Gaussian kernels) One of the natural choices for k and I is the 

Gaussian kernels. In this case, we can take k(u,u') = exp(—o~~ 2 \\u — u'\\ 2 ) and 
l(v,v') = exp(— j~ 2 \v — v'\ 2 ) where u,u' G M. d ° , v,v' € R and a and 7 are fixed 
parameters (can be taken to be 1). Then k and I satisfy condition (K). Since the 
Gaussian kernels are bounded with all their partial derivatives being bounded, condi- 
tions (M.d), (M.c), (M'.c-d) are automatically satisfied for any joint distribution of 
(X,r)). Also, condition (M.c) is implied by the simpler condition K[\g(X)\^ a ] < 00. 

5 Simulation study 

In this section we investigate the finite sample performance of the proposed testing 
procedure based on T n , as defined in (9), in two different scenarios: (a) testing for 
the independence of the error 77 and the predictor X, as in (1), when the regression 
model is well-specified; (b) testing for the goodness-of-fit of the parametric regression 
model when the independence of rj and X is assumed. As we have discussed in the 
Introduction, there are very few methods available to test (a), and hardly any when 
do > 2. For the goodness-of-fit of the parametric regression model there has been 
quite a lot of work in the statistical literature but we only consider a selected few 
procedures for comparison. 

We consider two data generating models. Model 1 is adapted from [SMQ98] (see 
Model 3 of their paper) and can be expressed as 

Y = 2 + 5X 1 - X 2 + aX x X 2 + n, 

with covariate X = (X\, . . . , X^ ) T , where X\ , . . . , X^ are distributed indepen- 
dently as Uniform(0, 1), and rj is drawn from a normal distribution with mean 0. 
[SMQ98] used do = 2 in their simulations but we use d$ = 4. 
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A 





5 


10 


15 


20 


25 


50 


n = 100 


5 


15 


26 


31 


32 


36 


44 


n = 200 


5 


40 


68 


74 


78 


81 


88 



Table 1: Percentage of times Model 1 was rejected, a = 0.05. 



A 





5 


10 


15 


20 


25 


50 


n = 100 


6 


27 


30 


34 


34 


34 


37 


n = 200 


5 


49 


63 


69 


71 


71 


75 



Table 2: Percentage of times Model 2 was rejected, a = 0.05. 

The other model, Model 2, is adapted from [FH01] (see Example 4 of their paper) 
and can be written as 

Y = Xi + aX\ + 2X 4 + rj, 

where X = (X\, X2, X3, X^ T is the covariate vector. The covariates X\, X2, X3 are 
normally distributed with mean and variance 1 and pairwise correlation 0.5. The 
predictor X^ is binary with probability of "success" 0.4 and independent of X\,X2 
and X3. Random samples of size n, where n is considered to be 100 and 200, are 
drawn from Model 1 (and also from Model 2) and a multiple linear regression model 
is fitted to the samples, without the interaction X1X2 term (X| term). Thus, these 
models are well-specified if and only if a = 0. In all the calculations, whenever re- 
quired, we use 1000 bootstrap samples to estimate the cut-off values for the tests. To 
implement our procedure we take Gaussian kernels with fixed bandwidths. To make 
our procedure invariant under linear transformations we work with standardized 
variables. 

5.1 Testing for the independence 

We consider the above two models with a = and the following error structure: 

,1*^(0™), 

where A = 0, 5, 10, 15, 20, 25, 50. Table 1 gives the percentage of times Model 1 was 
rejected as the sample size n and A vary, when a = 0.05. Table 2 gives the same 
results for Model 2. As expected, the power of the test increases monotonically with 
an increase in A and n. 
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a 





1 


2 


3 


4 


5 


7 


10 


T n 


4 


6 


11 


20 


34 


57 


89 


100 


Si 


5 


5 


7 


8 


11 


16 


28 


48 


s 2 


3 


5 


7 


10 


14 


21 


37 


60 


F 


7 


7 


7 


10 


16 


22 


46 


89 



Table 3: Percentage of times Model 1 was rejected when a = 0.05 and n = 100. 



a 





0.05 


0.10 


0.15 


0.20 


0.25 


0.30 


0.35 


0.40 


0.50 


0.60 


T n 


6 


7 


10 


14 


22 


31 


43 


57 


69 


81 


92 


Si 


5 


5 


6 


8 


13 


15 


25 


31 


41 


51 


69 


s 2 


5 


5 


8 


10 


16 


20 


31 


38 


49 


55 


68 


F 


8 


10 


8 


9 


10 


16 


23 


31 


42 


64 


84 



Table 4: Percentage of times Model 2 was rejected when a = 0.05 and n = 100. 



5.2 Goodness-of-fit test for parametric regression 

Under the assumption of independence of X and rj, our procedure can be used to 
test the goodness-of-fit of the fitted parametric model. In our simulation study we 
compare the performance of our method with that of [SMQ98] and [FH01]. [FH01] 
proposed a lack-of-fit test based on Fourier transforms under the assumption of i.i.d. 
Gaussian errors; also see [CS10] for a very similar method. The main drawback of 
this approach is that the method needs a reliable estimator of a 2 (the variance 
of rf) to compute the test-statistic, and it can be very difficult to obtain such an 
estimator under model mis-specification. We present the power study of the adaptive 
Neyman test {T%NV> see equation (2.1) of [FH01]) using the known a 2 (as a gold 
standard). We denote this test-statistic by F. Note that when using an estimate 
of a 2 , as in equation (2.10) of [FH01], we got very poor results. [SMQ98] uses 
the empirical process of the regressors marked by the residuals and use bootstrap 
approximations to find the cut-off of the test statistic. We implement this method 
using the IntRegGOF library in the R package. We denote the two variant test 
statistics - the Kolmogorov-Smirnov type and the Cramer-von Mises type - by Si 
and S2, respectively. From Tables 3 and 4 it is clear that our procedure has much 
better finite sample performance when compared to the competing methods. Note 
that as a increasing, the power of the test monotonically increases to 1. It even 
performs better than F, which uses the known a 2 . 
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5.3 Real data analysis 

We consider two well-known regression data sets and illustrate the performance 
of our procedure on them. The first data set involves understanding the relation 
between the atmospheric ozone level and a variety of atmospheric pollutants (e.g., 
nitrogen dioxide, carbon dioxide, sulphur dioxide, etc.) and contain 8 predictors, 
and is studied in equation (2) of [Xia09]. For a complete background on the data 
set see the reports of the World Health Organization (2003), Bonn, Switzerland. As 
illustrated in [Xia09], the data set certainly exhibits a non-linear trend. However, 
neither [SMQ98] nor [FH01] reject the linear model specification at 5% significance 
level, which implies that their methods are not efficient with multiple regressors. 
Our procedure yields a p-value of 0.02. 

The other example we consider is the savings data set given in [Far05] (see 
Chapter 3, page 31). This data set consists of some economic measurements collected 
for 50 countries and has 4 predictors. It is used in the book as an illustration of 
the various inferential techniques using multiple linear regression. For this data our 
method, along with that of [SMQ98] and [FH01], accepts the null hypothesis (3). We 
observe a p-value of 0.2 for our method. Thus, there is a natural agreement among 
the competing procedures, as can be expected from a data set used to demonstrate 
multiple linear regression. 



6 A convergence result 

In this section we present a result on triangular arrays of random variables that 
will help us understand the limit behavior of T n under the null and alternative 
hypotheses and yield the consistency of our bootstrap procedure. 

We denote by Z = (X, e) ~ P on M. d ° x R. For each n > 1, we will consider a 
triangular array of random vectors Zi n := 6j n ) for i = 1, . . . ,n, i.i.d. from a 

distribution P n on x R and a real vector j3 n , and define 

^in — di-^in) fin ~i~ ^im 

for i = 1, 2, . . . , n. We may assume that the random vectors Z, (^m)i<i<n<oo <we all 
defined on a common probability space and we denote by P and E the probability 
measure and the corresponding expectation operator on that probability space. 

We compute an estimator of /3 n , to be denoted by /?*, from the given data using 
the method of least squares, i.e., 

n \ n 

(3* = argmin^d ^ (Y in - g(X in ) T 0) = A~ l ■ - ^ g(X in )Y in , 
i=i n i=i 



11 



where A n := - X^iLi 9 {-^ in) 9 in) 1 is assumed to be invertible. We write 

e in = ^in ~ 9(-^in) fi n 

for the i-th residual (at stage n). We want to find the limit distribution of the test 
statistic 

^ n ^ n 1 n 

id i,j,q,r i,j,q 

where k : R d ° x R d ° ->• R, / : R x R R are kernels, % = fc(X in , X jn ), and 
= i(e* n , (*j n ) • We make the following assumptions to study the limiting behavior 
of T*- 

(CI) On measures P n . We will assume the following conditions. 

(a) Xi n and ej n are independent. In other words, P n = P n x x Pnei f° r an 
n, where P n ,x is a measure on R rf ° and P Wie is a measure on R. 

(b) E[ei„] = for all n. 

(c) There exists a distribution P = Px x P e on R d ° x R such that P n — ^ P. 

(d) (X ln ,<?(X ln )) ->■«, (X, 5 (X)), where X ~ P x . 

(C2) Uniform integrability conditions. The following families of random vari- 
ables are uniformly integrable for any 1 < p, q, r, s < 4, 

(a) {| 5 (X pn )|^:n>l}, 

(b) {|e pn | 2 : n > 1}, 

(c) {fc 2 (X pri ,X gri )(l + | 5 (X rn )| 2 )(l + IgiX^l) : n > 1}, and 

(d) {/ (fpri) ^qn) • W ^ 1} for / — /, i x , iy, i^x; iyj/j imy 

Theorem 6.1 Suppose that conditions (I), (K), (CI) and (C2) hold. ThennT* — >d 
X, where x has a non- degenerate distribution which depends upon P = Px x P e and 
is denoted by \ = x(Px X Pe) ( as i- n Theorem 3.1). 

The proof of Theorem 6.1 is relegated to the Appendix. 

7 Proofs of the main results 
7.1 Proof of the Theorem 3.1 

The proof is an easy consequence of Theorem 6.1, by taking P n = P for all n. Under 
Hq, P is in the product form Px x P„ which implies (CI. a). The conditions (Cl.b-d) 
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are also trivially satisfied. Moreover, condition (C2) is immediate from assumption 
(M). □ 



7.2 Proof of the Theorem 4.1 

We will apply Theorem 6.1 to derive the desired result by checking that assumptions 
(CI) and (C2) hold for each w 6 f!, outside a set of measure zero. Note that we 
will apply Theorem 6.1 conditional on 3, and thus the probability and expectation 
operators in Theorem 6.1 are now ¥ LJ and E w , respectively. We will apply the 
theorem with £j n 

= rj* n , X in = X* n , for all 1 < % < n, and with (random) measures 
P n = P n ,x x P n ,e° where, 

_. n 1 n 

Pn,x = -'Y]Sx i , and P n ^ e o = - V ]S e ?. 

i=l i=l 



Let 



m(X i )-g(X l ) T p + r ]i , (11) 



for i = l,...,n. Then ei,e2,...,e n are i.i.d. and let P e o be the distribution of 
e°:= ei-E[ ei ]. 

Let us start by verifying (CI). By definition, P n = P n ,x X Pn,e° is a product 
measure. We take P = Px X Pt°, where Px (resp. P e °) is the common of distribution 
of Xi (resp. e°). By Lemma A.2(ii), P n>e <> — ^ P e o almost surely. An application of 
the Glivenko-Cantelli theorem yields P n> x Px almost surely. Similarly, we have 
{Xl n ,g(X* ln )) ^ d (X,g(X)) a.s. Also, E[e ln ] = E w [r/* n ] = P n [e - e] = 0. 

We will now show that (C2) holds. First note that E w [|^J 2+<5 ] = P n [|e°| 2+<5 ] is 
bounded by a constant (depending on w) by Lemma A.2(iii). This shows (C2.b). 
To see that (C2.a) holds, observe that by assumption (M".a) and the SLLN, almost 
surely 

^[\g(X* ln )\ 2 + 5 ] = V n [\g(X)\ 2 + s ] -> E[\g(X)\ 2 + s ] < oo. 

To verify (C2.c), notice that the quantity of interest is a V-statistic. The SLLN for 
U-statistics along with the condition (M".c) imply that (C2.c) holds. 

It remains to check (C2.d). Throughout the rest of proof, we will use the notation 
l O"n ^ b n ' for two positive sequences of real numbers a n and b n to mean that a n < 
Cb n , for all n for some constant C. Consider / = l xx , l xy or l yy . Then, for q ^ r, 

1 n 

^[\f(v* q n,v; n )\ 2+s } = ^ E i/( e ^ e i)i 2+5 ' 
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that can be bounded by 

2 2+<5 n 



E [|/( e °, e °)-/( e? , e °)| 2+5 + |/( e °, e - 



1 2+5 



1 n 

< F n [\e°-e°\ 2+5 } + |/(e?,e?)| 2+ * = O w (l). 

re z * — ' J 



(12) 



In the first inequality above, we have used the Lipschitz continuity of /. Note that 

2+8 

by the SLLN of V-statistics n" 2 £?,-=i /(e?,^) ^> E [|/(ef, e§)| 2+<5 ] , which 
holds under the moment condition E[|/(e°, e°)| 2+<5 ] < oo for / = l xx ,l xy or l yy from 
(M".d). This fact along with Lemma A.2(i) justifies the equality in (12). 

A similar analysis can be done for q = r. Indeed, M u [\f(rj* Vqn)\ 2+S ] = Y^7=i \ f( e i > e i)\ 2+S l n 
is bounded by 



o2+<5 n 



i=l 



1 U 

< P B [|e? - e°| 2 + 5 ] + -£ <)| 2+5 = 0«(1). 

i=l 

Now consider / = l x or l y . For 1 < i < re, let Oj := \e° — e°\. Consider the 
following upper bound for |/(e°,e?)| which uses one term Taylor expansion for / 
and the Lipschitz continuity of the partial derivatives f x and f y . 

|/(e°,e°))| < \f(ele°)\+a i \f x (ele°)\+a j \f y (ele°)\+2L(a i + a j ), (13) 

Consequently, if q ^ r, E aJ [|/(r/* n , rj* n )\ 2+s ] is bounded by the following, up to a 
constant, 



2+<5 



2+<5l 



'J 



The first and the third term are 0^(1) by (M".d) and Lemma A.2(i). Note that 
Oi < d0 n - A)|oo|^WIoo + |e - E[e]| = O w (l)(l + bMloo)- 

Therefore, 

^Ei^^s^i^^^w-iEi^ + i^oioo^^e^i 2 ^, 



which is again 0^(1) by the SLLN of V-statistics which holds under (M".d). Simi- 
larly, 

iEK#,v°)r=o w (i). 



n- 



'■j 
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Putting these together, we obtain that 

E ul [\f( V * qn , V * rn )\ 2+S )=O u} (l) forger. 

A similar analysis shows that E w [|/(7/* , 77* )\ 2+S ] = O u (l). 

For / = I, we can closely imitate the above argument for / = l x or l y to deduce 
that E w [|/(r/* n , i]* n )\ 2+S ] = O w (l) for any 1 < q, r < 2. We just need to replace the 
inequality (13) with the following inequality that now follows from the two-term 
Taylor expansion of the function I. 

|Z(e?, e ?))| < lli^l + aiU^l+ajU^l + \a 2 \l xx {zU°)\ 
+\a 2 \l yy {e° i , e°)| + a iaj \l xy (e°, e°)| + 4L(a? + oj), 

We omit the details. Thus assumption (C2.d) of Theorem 6.1 holds. This concludes 
the proof of the Theorem 4.1. □ 



7.3 Proof of Theorem 3.2 

Let e, be as defined in (11). Note that the LSE f3 n as defined in (7) admits the 
following expansion around (3q: 

V^(A»-A>) = yfa^^^gmYi-foj 

n 

= (/ + 0p (l))n- 1 /2^ j4 -i 5( x 4 ) en (14) 



i=l 



where in the last step we have used the fact that A n — ^» A, which holds as 
E[|#(A)|^] < 00. By (14) (3 n admits the following expansion around /3o : 

V*0n ~ h) = {I + op(l))n~ V2 A^giXifo. 

i=l 

The normal equation for the regression model yields 

E[g(X)(m(X) - g(X) T p )} =0. 

Also, E[g(X)r]} = E[g(X)E[rj\X]} = 0. Hence, we have E[g(X)e] = 0. Moreover, 
by (M'.b), the covariance matrix A~ 1 E[g(X)g(X) T e 2 ]A~ l exists. So, by the central 
limit theorem (CLT), y/n(/3 n — /3q) converges in distribution to a Gaussian random 
vector with mean and covariance A~ 1 E[g(X)g(X) T e^A^ 1 . 
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We expand ly := /(e^e^) around l(ei,ej) using Taylor's theorem as 

lij — "I - (fii ^i)lx{'Jijni T"ijn) (Cj ^j^yilijm Tijn) 

for some point ( r yij n ,Tij n ) on the line joining (ei,ej) and (ej,ej). Using 
we can decompose T n , in the following way, 

T n = + - A,) T T« + 

where 



(15) 



for p = 0, 1 and 



Ixfa, ej)g(Xi) + Z y (e,, ej)g(Xj) 



We will first show the negligibility of the reminder term R n . More precisely, we 
claim that y/nR n — > 0. To prove the claim we need the following elementary lemma 
which we state without proof. 

Lemma 7.1 Let f : R 2 — > R be a continuously differentiable function with its partial 
derivatives f x ,fy being Lipschitz continuous with Lipschitz constant L (w.r.t. 
norm). Then for any u,v £ R 2 , 

\f(v) - f{u)\ < 2|V/(u)|oo|« - ^loo + 2L\u - vltc. 
An application of the above lemma together with (15) gives 

\lx(7ijn,Tijn) ~ lx{ei, £j)\oo ^ I (ei , €.j ) | oo | $n ~ $o\oo (bPQ) loo + \g( X j)\oo) 

+ \Pn-P \l o {\g(X i )\ oo + \g(X j )\ oo ) 2 . 

Similarly, we can bound {lyfaijn, Tijn) — ly(ti, e j)\oo- Finally, we can bound y/n\Rn\, 
up to a constant, by 



where, and Tn are defined as follows: 



T (3) 



I 3 - 



(16) 



■= ^ ENI(^ ) +* ) +^ ) )> ^ P = 2,3, 
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with 



l\f = |Hess(Z)(e i ,e j )U(|5(^)lL + b(^)| 2 oo), 

Note that TA ; and TA ; are V-statistics whose kernels are integrable by (M'.c-iii) 
and (M'.c-iv). Consequently, the WLLN of V-statistics holds for T„ and ■ Now 
since -^/n\f3 n — /?o|oo = Op(l), it follows that (16) — > and the claim is established. 

Thus it remains to find the limiting distribution of + (/3 n — /3o) T T^\ To do 
that first we will show that X and e are not independent under each of Hi, H 2 and 
H3 and hence 6(X,e) > where 6{X, e) is as defined in (5). Under hypothesis Hi, 
X JL r] and e = 77. Hence X /L e under Hi. For the case H2 and H3 we proceed as 
follows. The conditional mean of e given X is 

E[e|X] = m(X) - g(X) T ft + Efo|*] = m(X) - 5 (^) T /3 - 

Note that under H 2 or H3, m(X) 7^ g(X) T /3o with positive probability. In the case 
when m(X) — g(X) T /3q is a non-constant function of X, we have that E[e|X] depends 
on X, and hence X and e are not independent. The case m(X) = g(X) T (3q + c for 
some non-zero constant c does not arise for H2 by the assumption in Theorem 3.2. 
On the other hand, under H3, if m(X) = g(X) T (3q + c, then e = c + 77. Thus e and 
X are not independent. 

Letting Wi := (Xi, e^), TJ^ for p = 0, 1, can naturally be written as a ^-statistic 

l<g,r,s,t<n 

for some symmetric kernel given by 

1 {q,r,s,t) 

h®(W q , W r , W 8 , Wt) = ^ £ + m$ - 2%/^, 

where the sum is being taken over all 4! permutations of (q,r,s,t). Note that 
E[|/i(°)(W ? , W r , W s , W t )\ 2 ] < 00 for 1 < q,r, s,t < 4 by (M'.c-i) under hypothesis H, 
for each j = 1,2,3. Also, E[h<®(Wi, W 2 , W 3 , W 4 )] = 9(X,e) by the definition of 9. 
Thus appealing to the standard theory of V-statistics, we obtain 

n 

V^Czf) - 8(X,e)) = n-V 2 J2h ( ?(Wi) + o P (l), (17) 

i=l 

where /ij 0) («;) = E[h ( -°\w,W 2 , W 3 , W 4 )] - 0(J*f,e). Note that E[/4 0) (Wi)] = and 
E[/tS 0) (Wi) 2 ] < Var(/i(°)(Wi, W 2 ,W 3 ,W^)) < 00. 
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On the other hand, E[\h^(W q , W r , W s , W t )\oo] < oo for 1 < r, s, t < 4 by 
(M'.c-ii) under hypothesis Hj for each j = 1, 2, 3. So by the WLLN for V-statistics, 
we have 

T« A 7 := E[h«(Wi, W 2 , W 3 , W 4 )\. 

From (14) and (17), 

v^(T„ - 0(X, e)) = v^(4 0) " »(JT, e)) + y^0 n ~ A)) T T« + o P (l) 



n 



n 

'^^[h^iWi) +~f T A- 1 g(X i )e i ] +o P (l) 



i=l 



which by the CLT has an asymptotic normal distribution with mean and variance 

\ a r(hf ) (W 1 ) + 1 T A- 1 g(X 1 )e 1 ). (18) 

□ 

Acknowledgments. The authors would like to thank Probal Chaudhuri, Victor 
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A Appendix 



A.l Proof of the Theorem 6.1 

Observe that 

e in ~ e in = —{fin — fin) T g{Xir, 

Using (19) and by Taylor's expansion 



% = + (X - /3n) T 4) ) + \W - P») T, »Wn ~ fin) 



where 

;(0) ._ l .._u e . ) 



(1) 



'-J 

V: 



'J 



(19) 



(20) 



)g(Xj n )\, and 
)g(Xi n )g(Xi n ) + l yy ('&ijn,Tij n )g(Xi n )g(Xi n ) 

+ 2l X y($ij n , Tij n )g(Xi n )g(Xj n ) 

for some point ("&ij n ,Tij n ) on the straight line connecting the two points (e* n ,£j n ) 
and (e in ,e jn ). 

In view of (20), we can decompose T* in the following way 

T* n = T(°) + (fi* n - fi n ) T T^ + \{fin~ Pn) T TM(fi* n - fi n ) + R n , (21) 
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where 



T ( P ) _ J_ v k- i {p) + — V -/W - — V k -i (p) 

i,3 i,j,q,r i,j,q 



,(2) 



for p G {0, 1, 2}, and 

n ) 

and R n is the reminder term. Note that if) G M, 1$ G M d , and L ( 9 G M dx,i . 

For p G {0,1,2}, T„ can be expressed as a F-statistic (but with triangular 
arrays) of the form 

-^n^ — 4 ^ ^ h (Zj, nt Zjn, Zq n , Z rn *) t (22) 
l<i,j',g,r<n 

for some symmetric kernel given by 

(h3,<l,r) 

hP* {Zi ni Zj nj Zq n , Z rn ) = — ^ ^ ktu^tu "t" ktulvw ^^tu^tv ' (23) 

(t,u,v,w) 

where the sum is being taken over all 4! permutations of (i,j,q,r). 
A. 1.1 Getting rid of triangular sequence 

Let Zi := (Xi,ei) be i.i.d. from P. By the Skorohod representation theorem, there 
exists a sufficiently rich probability space (fi,P), independent random elements 
ui,U2,--- defined on £1 and functions f n ,f with Zj n := f n {^i), Zi := /(wj) such 
that 2/j n = .Zjn, = .Zj and 

ry P-a.s. r % 

^in — r Zi, as n — > oo. 

Since we are only concerned about the distributional limit of nT*, henceforth in 
this proof, we may assume, without loss of generality, that for each n, the random 
vectors Wi n := (Zi n , ZA are independent and for each i, Zi n — > Zi almost surely as 
n — > oo. This argument is similar to that in [LN09]. 
We will start by showing that 

n 

A n := n-^aiXMXin) 1 ^A:= E[ 5 (X 1 ) 5 (X 1 ) T ]. 

i=l 
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By assumption (C2.a), for any 1 < p,q < d, g p (Xi n )g q (Xi n ) are uniformly inte- 
grable. Moreover, by (Cl.d), we have g p (Xi n )g q (X ln ) -+ d g p (X 1 )g q (Xi). Therefore, 
g P {Xin)g q {X ln ) -)■ g^X^g^Xi) and W\g p {X{)g q {Xi)\} < oo. Hence, we obtain 
that n' 1 Y%=i 9( x i)9( x i) T ^ A h Y the WLLN. Finally, observe that 

n n 

J2 g(X in )g(X m ) T - n- 1 £ 9{X i )g{X i ) T 3 0, 



n" 1 



i=l i=l 



as n — ^ oo since g p {Xi n )g q {Xi n ) — > g p (X\)g q {Xi). This completes the proof that 
A n A. As a consequence, A n is invertible (and hence /3* is defined) with high 
probability as n — > oo. 

Note that /3* admits the following expansion 



n 1 ' 2 ^ - /3 n ) = n-^A- 1 Y,9(Xin)(Yin - g(X in ) T p n ^ 

i=l 
n 

= n- 1 ' 2 A-^aiXin^in- (24) 

i=l 

Next we claim that 

where Q n := n" 1 / 2 ^ -1 X^ILi d( x i) e i- We will first show that 

n 

^A-^diXin^in-Cn^O. (26) 



11 

i=l 



L 2 

Clearly, it suffices to show that n -1 ' 2 Yl?=i(9p(- X in) e in — g P {Xi n )t in ) — > for each 
1 < p < d. Indeed, we can write the square of its L 2 -norm as 

n 

n ~ i]E [ ^2 (9 P (Xin)ein ~ g P (Xi)ei)(g p (X jn )e jn - g p (Xj)ej 

\2 



E 



(g P (X ln )e ln - g p (Xi)eiY 



which goes to as n — > oo. 

This is because g p (Xi n )ei n — >d g p {X{)ei and g 2 (Xi n )e 2 n is uniformly integrable 
by assumption (C2.a), (C2.b) and the independence of X\ n and e\ n . This proves 
(26). Recall that, from (24), 



(P*n ~ Pn) = (An 1 A) x n^A- 1 Y J 9{X in )e in . 



i=l 
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Since by CLT, Q n converges in distribution to a multivariate normal, (26) implies 
that n~ l l 2 A~ l Ya=i = Op(l). Consequently, 

n 

n 1 ' 2 ^ - /3 n ) - n-^A^Y^giXin^in A 0. 

Now (25) follows from (26). 

Let Vn , for p G {0,1,2}, be denned analogously as Tn in (22) but with 
Zi n = (Xi n ,ei n ) replaced by Zi = (Aj,e.j). Note that is a proper V-statistic. 
Our next goal is to show that 

7^-p/ 2 (T^ - V<*>) % 0, for p e {0, 1, 2}. (27) 

Observe that, 



E 



n 2 ~nv((T^ -K (P) )(4 P) "K (P) ) T )1 = ^X;E[tr(^)(i)^)(j) T )], 



where i = (ii, fa, fa, 24) and j = (ji, 32->jz, ji) are multi-indices in {1, ... ,n} 4 , and 

fcW(i) := ^(Z iin , . . . , Z un ) - hP>(Z h ,. ..,Z i4 ). 

We will first show that (Zi in , . . . , Zj 4n )|^ c is uniformly integrable. It is 
enough to show that each of the terms like |& rs ^looj where r,s,t,u G {1,2,3,4} 
may not be necessarily distinct, is uniformly integrable. Using the independence of 
Xi n and €i n , we see that this follows directly from assumption (C2.c) and (C2.d). 
Assumption (Cl.d) and the continuous mapping theorem implies that, 

h^(Zi lTl , . . . , Z{ 4n ) — >d InffiiZ^, . . . , Z{ 4 ). 

Thus the above convergence also holds in I? and we have that 

E[\h^\Z il ,...,Z u )\l ]<oc. 

Consequently, E[|^(i)|^] is uniformly bounded for all i and n. An application 
of the Cauchy-Schwarz inequality yields 

E[|fi«(i)|oo|& W a)U < (E[|/ l W(i)|L]) 1/2 (IE[|/ 1 ^(j)|L]) 1/2 , 

implying that lEfl/i^ (i) |oo (j) |oo] is uniformly bounded. We now make the fol- 
lowing observations. 

1. The number of multi- indices i and j for which |iU j| = k is bounded above by 
n k , for 1 < k < 8. 



21 



2. The kernel is degenerate of order 1, hence E[/i(°)(i)M°)(j)] = when 
|iU j| = 7 or 8. 

3. The kernel E[/jW {Z\ n , . . . , Z± n )\ = (we will show this in Lemma A.l), hence 
when |iUj| =8, E[^ 1 )(i)^W(j) T ] = E[/i( 1 )(i)]E[/i( 1 )(j) T ] = 0. 

Putting the above observations together, it remains to prove that 

E[tT(h ip) (i)h (p) (j) T )] -> for any i,j such that |iUj| = 6 + p, 
for p£ {0, 1, 2}. But this immediately follows from the fact 

which has been already shown. Hence (27) is proved. 
Finally, we claim that 



n x 



rw + 08* - a,) t tW + ^ - /3n) T T( 2 )(/?; - p n ) - v® 



(28) 



which now easily follows from (26) and (27). 

A. 1.2 Negligibility of the reminder term R n 

In this subsection, we will show that the reminder term can be ignored for future 
analysis. More precisely, we will prove that 

nR n -4 0. (29) 

Let us define 

Q n = ^2 kiji v ij ~ 4j^) + T4 ^ kij{v qr — l q )) — 2— g ^ hj(Vi g — l\q), 
ij i,j,q,r i,j,q 

so that i? n = ±(/3* - P n ) T Q n (ft - /3„). Since by (25) n 1 / 2 ^*, - p n ) = O p (1), it is 
enough to show that for each 1 < s, t < d, 

{Qn)st ^ 0. 

Note that {Q n )st is the sum of three terms and each of those terms can be shown 
to converge to in probability. We will only spell out the details for the first term 
leaving the other two terms for the reader. 
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Thus we need to show that 

1 n 

^E%K--4 2) )^°- (30) 

id 

(2) 

Note that (i;^- — lh ) s t can be further broken down into three terms; the first one 
being (l xx (tiij ( Cinj ejn))g s (^Qn)5t(- x 'tn)- Tne other two terms involve l yy 

and l xy . Using the Lipschitz continuity of l X x,lyy and l X y we obtain the following 
bound. 

I«i-4 2) )s*l £ L|(e: j ,el i )-(6 4 n,6,n)|oo(|5(^n)|oo + b(^n)|oo) 2 

< - /^uo^x^u + | 5 (x in )ioo) 3 . 

Therefore, n~ 2 l^ylK^f/ ~~ 4j )st\ is bounded above by 

1 ™ 

8dL|/3; - • ^ £ |%|(l<?(*m)lL + | 5 (X in )|L)- 

Observe that 

1 n 

iE E [NKbPUiL + b(x in )|L)] = o(i), 

by assumption (C2.c) and hence (30) follows. We can apply similar techniques to 
control the other two terms in Q n . Hence, Q n = op(l). 

A. 1.3 Finding the limiting distribution 

In this subsection, we will finally prove that nT* converges to a non-degenerate 
distribution. Note that by (21), (28) and (29), it is enough to show that the random 
variable 

nV^ + CAn^V^) + \cM i] Cn 

converges in distribution, where Vn , for p G {0,1,2}, is defined near (27). The 
kernel is degenerate of order 1, i.e., M[h^(zi, Z 2 , Z3, Z4)] = a.s. Let us define 

hf( Zl ,z 2 ) :=E[h ( ~°\z 1 ,z 2 ,Z 3 ,Z 4 )} 

and let Sn^ be the V-statistic with kernel h 2 °\ i.e., 

1 n 

^ = ^£4 0) (^)- 
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By the standard theory of V-statistics, 

The symmetric function h 2 admits an eigenvalue decomposition 

oo 

hf\zi,z 2 ) = ^ \ r (p r (zi)(p r (z 2 ) 

r=0 

where (v?r)r>o is an orthonormal basis of L 2 (M. do+1 ,P) and A r is the eigenvalue 
corresponding to the eigenvector ip r . Since h 2 °^ is degenerate of order 1, Ao = 0, ipo = 
1. Therefore, E[y> r (Zi)] = for each r > 1. Also, Y^r X r = ^[h 2 °\z u Z 2 ) 2 ] < oo. 
We use the above decomposition of h 2 °^ to express Sn as follows 

oo n „ 

r=l i=l 

Let us now turn our attention to ■ It is again a V-statistic whose kernel M 1 ' has 
mean zero, i.e., E[hP*(Zi, Z 2 , Z 3 , Za)\ = (see Lemma A.l). Therefore, if we define 
its first order projection by 

h^(zi) :=E[h^(z u Z 2 ,Z 3 ,Z 4 )], 

then 

n 
i=l 

On the other hand, by the WLLN for V-statistics, we have 

V® 4- E[M 2 ) (Z u Z 2 , Z 3 , Z 4 )] =: A 6 R dxd . 
By multivariate CLT, the random vectors 

n n 

(n-V^^ (Zi )) r>i , fcW^), Cn , 

i=l i=l 
converge in distribution to jointly Gaussian random variables 

Z = {Zr) r > U M = {M) l < i < d ,W = {Wi)l<i< d , 

where Z r are i.i.d. N(0,1), N ~ JV d (0,H) and W ~ N d (0,a 2 I), with a 2 := E[e 2 ] 
and S := E[/if ^^iJ^^Zi) 1 "]. Also, the covariance structure between the random 
variables Z r ,M and W are given by 

E[Z r A/] =E[ip r (Z 1 )h^ L \Zi)], 
E[Z r W] = A- x E[g{X l )e Wr {Z l )l 
E[WM T ] = A- 1 E[e 1 g(X 1 )h?(Z 1 ) T ]. 
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Therefore, by the continuous mapping theorem, 

nT n * = nV^+C T n {n ll2 V^)+ l -CM 2 \ n + o v (l) 



2 

oo d 1 ^ 

-+ d Y^X/Z; ■ Y^W-.X, I - £ 1V,A, ; VV ; : \ - (31) 

r=l i=l i,j'=l 

which concludes the proof of the theorem. 

Lemma A.l Let hP~' be the symmetric kernel as defined in (23). Let Zi,Z 2l Z 3 
and Z^ be i.i.d. random vectors with Zi = (Xj,ej) E K^ x R where X{ and ei are 
independent. Then 

E[hP*(Z 1 ,...,Z i )]=0. 

Proof: We have 

(1,2,3,4) 

/»« (Z 1 ,Z 2 , z 3 , z 4 ) = - k ^ l tu + W$ - 2ktol£\ 

(t,u,v,w) 

where the sum is being taken over all 4! permutations of (1,2,3,4). We claim that 
^[ktuliu + ktulml — 2k tu l^] = f° r each such permutation from which the lemma 
would follow immediately. Recall that 

z(i) 



■ lx(ei,ej)g(Xi) + l y (e i ,e j )g(X j ) =: + Rij (say). 
Using the independence of Xj and e*, we obtain that 

+ 2E[k(X t ,X u )g(X t )]E[l x (e t ,e v )} 

= E[k(ei, e 2 )] (E[fc(Xi, X 2 )^(Xi)] - E[fc(JTi,X 2 )^(X 3 )]) . 
Similarly, 

E[kt u Rtu + kmRvw — 2kt u Rtv] 
= -E[k(X t ,X u )g(X u )]E[l y (e u e u )]-E[k(X t ,X u )g(X w )]E[l 

+ 2E[k(X u X u )g(X v )]E[l y (e t ,€ v )] 
= E[l y (e 1 ,e2)]{E[k{X 1 ,X 2 )g(Xs)]-E[k(X 1 ,X 2 )g(X 2 )]). 
Since k is symmetric, E[fc(Xi, X 2 )<7(X 2 )] = E[A;(Xi, X 2 )^(Xi)] and since Z is sym- 
metric, we have l x (a,b) = l y (b,a) which implies that EfZ^ei, e 2 )] = E[l y (ei, e 2 )]. 
Hence, 

iito 2kt u Qtv] -t~ E[/ct u -Rt u + kt u R vw 2kf U Rt v ] — 0, 
and consequently, E[Z«t u Z^ + fe^Z^ - 2&t„z£' ) ] = 0. □ 
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A. 1.4 The empirical distribution of the residuals 

In the following lemma we gather a few standard results about the empirical distri- 
bution of the residuals for the linear regression model Y = m(X) + r]. 

Lemma A. 2 Under the conditions (I), (M" .a) and (M".b), the following statements 
hold: 

(i) P n [|e° - e°| r ] ^> for each < r < 4 + 25; 

(ii) P n>e o -> d e° a.s.; 

(Hi) sup n P n [|e°| 2+5 ] < oo a.s. 

Proof: Note that - e, = -g(X i ) T n - /3 ). Thus, 

Pn[|e-eH < dlPn-PolMlgWU. 

Hence, P n [|e-e| r ] ^> using the facts that E[| 5 (X)|^+ 25 ] < oo by (M".a) and that 
Pn — > A) by (14) and E[|g(X)e|] < oo, the latter being guaranteed by (M".a) and 
(M".b). Next we obtain 

e = P n [e] = F n [m(X)] - F n [g(X)] T /3 n ^ E[m(X)] - E[g(X)} T fa = E[e]. 

This completes the proof of (i). 

Letting P n ^° to be the empirical measure of e°, e^, . • . , e° , we first observe that 

J e* x dP n , e o(x) = e -« s P n [e« 6 ] 
and hence, for any ^ G 1, we have, 



J e^ x dP n>e o(x) - e-^'^V J e^ x dP n ^(x) 



\[e^]-F n [e^] 



< |e|P n [|e-e|]^>0, 



by applying part (i) with r = 1. Now since P n% z° — y d e° a.s. by the Glivenko- 
Cantelli lemma and e —> E[e] as shown in the part (i) of the lemma, we have 
J e l ^ x dP nfi o(x) J e l ^ x dP t o{x), which, by Levy's continuity theorem, yields (ii). 
To prove (iii), we write 

P n [|e°| 2+5 ] = P n [| e -e| 2+5 ]=P n [|( e -e) + (e-E[e])-(e-E[e])| 2+<5 ] 
< 3 2+<5 (p n [|e - e\ 2+s ] + P„[|e°| 2+5 ] + |e - E[e]| 2+ ^ 



The result is then an immediate consequence of the fact that P n [|e°| 2+5 ] a -4' E[|e°| 2+<5 ] < 
oo by (M".a) and (M".b), that e ^ E[e], and part (i) of the lemma. □ 
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